Tutorial 1: Cell Sentence Conversion & Reconstruction¶

In this tutorial, we will explore one of the key concepts in the Cell2Sentence (C2S) framework: the transformation of single-cell RNA sequencing data into what we call 'cell sentences'. These cell sentences are sequences of gene names ordered by their expression levels within each cell. This transformation allows us to leverage Large Language Models (LLMs) for single-cell biology in a flexible way, and take advantage of their natural language capabilities and large-scale pretraining.

We will also demonstrate how to reconstruct the original gene expression vectors from these cell sentences using a simple linear model. This process is important for validating that the information encoded in the cell sentences can be effectively decoded back into its original form without losing too much information.

  • Dataset Information:
    • Source: Immune System tissue dataset from Domínguez Conde et al. (2022)
    • Link: Science Article
    • Citation: Domínguez Conde, C., et al. "Cross-tissue immune cell analysis reveals tissue-specific features in humans." Science 376.6594 (2022): eabl5197.
  • Subset Used: Two donors (A29 and A31), totaling 29,773 cells.
  • Preprocessing: The data was preprocessed in Tutorial 0 (Data Preparation). We will start with this preprocessed data in this tutorial.

First, we will import the necessary libraries. These include general-purpose Python libraries, as well as specialized libraries for single-cell RNA sequencing data analysis.

In [1]:
######## Load modules ########
from __future__ import annotations #default now for name.error issue
import os
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
import scipy
from tqdm import tqdm

# Cell2Sentence imports
import cell2sentence as cs
from cell2sentence.utils import benchmark_expression_conversion, reconstruct_expression_from_cell_sentence

# Single-cell libraries
import scanpy as sc
import anndata as ad
from collections import Counter #count table

sc.set_figure_params(dpi=300, color_map="viridis_r", facecolor="white", )
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
scanpy==1.9.8 anndata==0.9.2 umap==0.5.7 numpy==1.24.4 scipy==1.10.1 pandas==2.0.3 scikit-learn==1.3.2 statsmodels==0.14.1 pynndescent==0.5.13

Load Data and Double-check¶

Next, we will load the preprocessed dataset from the previous tutorial. This dataset has already been filtered and normalized, so it it ready for transformation into cell sentences.

Please make sure you have completed the preprocessing steps in Tutorial 0 before running the following code, if you are using your own dataset.. Ensure that the file path is correctly set in DATA_PATH to where your preprocessed data was saved from tutorial 0.

In [3]:
DATA_PATH = "/ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_conde_immune_tissue_two_donors_preprocessed_tutorial_0_PYFILE.h5ad"
In [4]:
adata = ad.read_h5ad(DATA_PATH)
adata #AnnData object with n_obs × n_vars = 29773 × 23944
Out[4]:
AnnData object with n_obs × n_vars = 29773 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_name', 'ensembl_id', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: 'batch_condition_colors', 'cell_type_colors', 'log1p', 'neighbors', 'pca', 'tissue_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'
In [9]:
adata.obs.head()
Out[9]:
cell_type tissue batch_condition organism assay sex n_genes n_genes_by_counts total_counts total_counts_mt pct_counts_mt
Pan_T7935490_AAACCTGCAAATTGCC CD4-positive helper T cell ileum A29 Homo sapiens 10x 5' v1 female 2191 2191 6542.0 327.0 4.998472
Pan_T7935490_AAACGGGCATCTGGTA CD8-positive, alpha-beta memory T cell ileum A29 Homo sapiens 10x 5' v1 female 2046 2046 5871.0 429.0 7.307103
Pan_T7935490_AAACGGGTCTTGCATT CD8-positive, alpha-beta memory T cell ileum A29 Homo sapiens 10x 5' v1 female 2129 2129 7248.0 337.0 4.649559
Pan_T7935490_AAAGCAATCATCGCTC CD8-positive, alpha-beta memory T cell ileum A29 Homo sapiens 10x 5' v1 female 1262 1262 3927.0 305.0 7.766743
Pan_T7935490_AAAGTAGCAGTCACTA gamma-delta T cell ileum A29 Homo sapiens 10x 5' v1 female 2248 2248 6574.0 1083.0 16.473988
In [10]:
adata.var.head()
Out[10]:
gene_name ensembl_id n_cells mt n_cells_by_counts mean_counts pct_dropout_by_counts total_counts
RP11-34P13 RP11-34P13 ENSG00000238009 38 False 38 0.001310 99.872368 39.0
RP11-34P13-3 RP11-34P13 ENSG00000241860 106 False 106 0.003627 99.643973 108.0
AP006222 AP006222 ENSG00000286448 7 False 7 0.000235 99.976489 7.0
LINC01409 LINC01409 ENSG00000237491 1292 False 1292 0.045981 95.660498 1369.0
FAM87B FAM87B ENSG00000177757 3 False 3 0.000101 99.989924 3.0
In [10]:
sc.pl.umap(
    adata,
    color="cell_type",
    size=8,
    title="Human Immune Tissue UMAP",
)
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
No description has been provided for this image
In [7]:
adata.X.max()
Out[7]:
3.408124

We are expecting log10 base 10 transformed data, with a maximum value somewhere around 3 or 4. Make sure to start with processed and normalized data when doing the cell sentence conversion!

Cell2Sentence Conversion¶

Now that we have preprocessed and normalized data loaded, we will perform the conversion to cell sentences. In this section, we will transform our AnnData object containing our single-cell dataset into a Cell2Sentence (C2S) dataset by calling the functions of the CSData class in the C2S code base. Full documentation for the functions of the CSData class can be found in the documentation page of C2S.

First, we define which columns in adata.obs we would like to keep in our C2S dataset. Cell type, tissue, donor, organism, and sex information will be useful to keep, so we will define a list with these labels:

In [12]:
print(adata.obs.columns)
adata_obs_cols_to_keep = ["cell_type", "tissue", "batch_condition", "organism", "sex"]
Index(['cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex',
       'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt',
       'pct_counts_mt'],
      dtype='object')

Now, we create a CSData object using the adata_to_arrow() class function of the CSData model class. This will return us a Huggingface PyArrow dataset (see https://huggingface.co/docs/datasets/en/about_arrow)

In [13]:
# Create CSData object
arrow_ds, vocabulary = cs.CSData.adata_to_arrow(
    adata=adata, 
    random_state=SEED, 
    sentence_delimiter=' ',
    label_col_names=adata_obs_cols_to_keep
)
100%|██████████| 29773/29773 [00:13<00:00, 2240.86it/s]

Let's examine the arrow dataset which was created:

In [25]:
print(arrow_ds)
print(arrow_ds.shape)
Dataset({
    features: ['cell_name', 'cell_sentence', 'cell_type', 'tissue', 'batch_condition', 'organism', 'sex'],
    num_rows: 29773
})
(29773, 7)

We can see that our 29,773 cells have now been converted into rows of a Dataset object. The metadata columns of our adata object have been preserved, and two new columns have been added: cell_name and cell_sentence. These columns contain unique cell identifiers as well as cell sentences, respectively. Each cell sentence consists of a string of space-separated gene names, in order of descending expression value. For more details about the cell sentence creation process, please refer to the C2S paper.

We can look at one arrow dataset example as follows:

In [26]:
sample_idx = 0
arrow_ds[sample_idx]
Out[26]:
{'cell_name': 'Pan_T7935490_AAACCTGCAAATTGCC',
 'cell_sentence': 'RPLP1 ACTB EEF1A1 HSP90AA1 TMSB4X B2M FTH1 KLF6 HSPA1B MALAT1 RPS12 HSPA8 RPL13 MT-CO1 ATF3 MT-CO2 RPL41 TPT1 MT-CO3 RPS19 HLA-B RPL10 RPS4X RPL28 MT-CYB DUSP1 RPL30 MT-ND4L RPS15 FOS RPL34 RPS2 RPLP2 MT-ND3 RPS18 RPS8 TRBV7-2 RPL32 RPS3 ANXA1 RPL11 HLA-C RPS27 ACTG1 UBC RPL3 RPL37 RPLP0 MT-ATP6 JUNB RPS28 RPL18 UBB MT-ATP8 RPS14 RPL39 PFN1 GAPDH HSPA1A RPL18A SRGN RPS27A RPL26 RPL19 RPS15A HLA-A DNAJB1 RPS3A CREM RPS13 MT-ND1 RPL21 RPS25 BTG2 RPL35A FAU RPL8 RPL7A RPS24 RPS6 RPS16 RACK1 NFKBIA RGS1 RPL29 CALM1 RPL9 RPL37A MT-ND5 TNFAIP3 RPS23 IL7R RPL36A PTMA NFKBIZ UBA52 EIF1 CRIP1 CORO1A RPL14 HSP90AB1 RPL10A CXCR4 RPL4 EEF1B2 RPL36 RPS9 RPL27 NACA VIM H3-3B RPS7 HSPH1 ATP5F1E HLA-E RPL17 RPSA MYL12A RPL12 CD69 TAGAP RPL35 RPS29 RPL6 SARAF ZFP36L2 MT-ND4 ARHGDIB BTG1 RPS21 EEF1D PNRC1 EEF1G HSPA5 FYB1 CD3E IFITM1 RNASEK EEF2 MT-ND2 FTL S100A4 JUN IFITM2 CYTIP OST4 LAPTM5 RPL36AL PLAAT4 PFDN5 SAMSN1 DNAJA1 EIF4A1 FXYD5 HSPE1 CTLA4 IL32 RPL24 CFL1 SAT1 RPL13A NPM1 NDUFV2 SRSF7 ITM2B POLR1D CCNI CALR ZFP36 COX7C PTPRC GADD45A CD3G RAC2 MYL6 UBE2D3 CCL20 RPS5 RPL22 TOMM7 RPL15 C12ORF57 LSP1 SON H4C3 RPL23A RPL31 DDX5 EVL AHR SERF2 STK17A CD6 PTGER4 OSTF1 OAZ1 TMA7 PHLDA1 COMMD6 PPIA TMSB10 RHOH TUBA1B KTN1 PDIA3 DUSP5 BTF3 SSR4 ATP5MC2 S100A6 CEBPD TRAV13-2 STAT1 HSPB1 PSMB9 RPL38 HNRNPA2B1 FLT3LG CSTB CDK2AP2 CSK MIF MCL1 PSME1 TXNIP RHOA HNRNPA1 UBL5 PTGES3 PDCD4 SLC2A3 ERCC1 SNHG29 NFKBID SRP14 ARPC1B NR4A2 CMTM6 CD44 ITGA4 GNB2 TRAV19 BHLHE40 HMGB1 ARPC3 TRIR PRKCH NME2 KLRB1 CLIC1 ANAPC16 NCOR1 COX5B MYL12B SAP18 RHOB TCF25 SOD1 SRSF9 FOSB PCBP1 CD96 TBC1D10C CD164 CD3D ACTR2 COX4I1 SH3BGRL3 KMT2E HSPD1 PABPC1 SPCS2 SERP1 TNFRSF1B TAGLN2 PCBP2 RPL5 LDLR YWHAB CSRNP1 COX8A MYADM TRMT112 ATP5MG GABARAP ELF1 PRR13 RPL22L1 WAKMAR2 SIT1 GADD45B RPS17 S100A10 RBBP4 HNRNPDL CD2 UQCRQ COX6B1 RGS2 CHCHD2 TAF1B RPL27A TAP1 MTDH CD99 ARF1 TBCA TMEM14B HNRNPA0 RPS11 SEC61A1 RSRP1 TUBB ACIN1 HNRNPUL1 OXNAD1 LMO4 CTSH RBM8A RGL4 NFU1 AP1G2 TSC22D3 CASS4 CCNH RPS26 PIK3R1 SH3KBP1 SSBP4 TPR GADD45G ZFAS1 COTL1 BCL7C MZT2A LAT FUS SDF2L1 KRTCAP2 MRPL41 CENPX SETD5 TMC8 COX17 DNAJB4 CDKN1B ODC1 ACAP1 ANXA6 CYBA SUMO3 TMBIM6 TMEM259 CD37 PRDX2 UBE2D2 BTG3 PARK7 CD247 TMEM258 HNRNPAB SPOCK2 SEPTIN1 LAMTOR5 RNF11 SNX5 SASH3 PPM1G ARF6 EIF5A IL4I1 RBM3 RSF1 SLA CALM3 SNRPB SHISA5 ARID5B KXD1 PTPRCAP GBP1 STN1 NSA2 ATP1A1 LCK PSMB8 PSMB3 HNRNPC RNF167 RESF1 ACTR3 CKLF CD52 RPS27L GSTK1 GIMAP7 DCXR RGS14 VASP PBXIP1 GOLGA4 RPL23 YWHAQ SNHG16 TAOK3 IGBP1 DOCK10 SRRM1 PSMC3 IKZF3 AHI1 RPL7 TMEM87A EIF5 MED1 ANKRD12 IWS1 SLC25A6 TUBA1C LRRC41 SEC11C DLST VMA21 MKNK2 RALB THUMPD1 DBF4 RHEB WASHC3 VRK2 SUPT4H1 FADD CHD1 CAPN1 TMEM179B CKS2 RGS10 EIF4E BPTF RNMT DDX39A GADD45GIP1 DYNLL1 SF3B2 CDKN2AIP RAP1B ACTN4 ABRACL MOB4 METTL23 ITSN2 MTF2 VSIR PER1 COX7A2 COX14 IPCEF1 YWHAH CYRIB MOB3A SP2 BCL11B STUB1 RNF213 PRKRA EDF1 PRPF38B RHOG BAG1 ATOX1 CACYBP MAP4K1 PPP1R2 PAXX ADA NDUFA12 WAC ELK3 PRRC2C MUS81 CCT4 ARL5B RALBP1 FBXW11 LCP1 LEPROTL1 PDCD10 HNRNPR NR3C1 PPP4C BATF ABLIM1 PSMA2 TRAT1 RUNX3 CIB1 ARPC2 ABI3 UGP2 NDUFAB1 HCLS1 DERL1 RAB5A INSIG1 CNBP CHROMR DSTYK DPP4 TOMM6 PDE4D LUC7L2 JUND ARHGDIA CEMIP2 POLE4 ANXA11 PLEKHB2 SDCBP MANF FAM83D IMP3 TMCO1 SUCLG1 MBD2 NDUFA10 PDIA6 S100A11 DNPH1 TARDBP PTDSS1 RP11-25K19 LPXN FYTTD1 RORA NASP ATP6V1F RIPK2 IDI1 RASSF7 MT-ND6 RWDD1 CYB5R3 MAZ PIM1 PSMA3 GPX4 PACS1 TNIP1 PLEKHF1 STXBP2 MSMO1 UQCR11 EPB41L4A-AS1 TMIGD2 PATL1 COX6A1 CCND3 TUBB4B OCIAD2 UBE2V2 ASMTL MAP3K8 FERMT3 SH2B1 ACP5 RPS6KB2 GABPB1-AS1 MTRNR2L12 ZFAND2A EIF3D ATP5PB NEDD8 NDUFA1 C19ORF25 CYB561D2_ENSG00000114395 PPP1R18 SYF2 PIGT SLC25A3 KHDRBS1 GIMAP4 NAXE HMGN1 RGS19 SUMO1 YWHAZ ERN1 LRRFIP1 NCF1 GALNT11 PSMG2 SNRPA1 IFNGR1 ENSA LAMTOR4 TBRG1 LINC-PINT PSMA7 SH2D2A HCST GUSB SEC61B TAX1BP1 RNF149 GNG5 PIP5K1A SIGIRR ATP6V1H RXYLT1 TRAPPC2B STK4 GPR174 CLEC2B CDC5L MLEC SQSTM1 GNAS DAXX ADAR NRIP1 TPI1 ITM2C NOSIP HADHA GTF2B ATP5IF1 ATP6V0B VEZF1 MYBL1 TOMM5 UQCRFS1 PDE4B ELL MRPL18 TRAC PIGBOS1 PPIB MDH2 PMVK PGAM1 PEX2 RPA2 FBXW5 MICOS13 CYBC1 ELOB PSMB4 COX6C GTF2I CHMP4A ALOX5AP SSR3 SELENOF AIP UXT TBCB SKAP1 RBM4 NDUFA9 DNAJB11 ATP5MK REX1BD GATA3 MVP GIMAP5 DNAJC2 SHKBP1 CAP1 SSBP1 SUZ12 SNHG6 LTB HECA WDR83OS DUT ANXA2 RRAGA PPP1R12A TAPBP DOK2 DYNLT3 RBPJ ATP6V1G1 CD40LG OCIAD1 TCEA1 ATM SLC44A2 OGA HNRNPK PPP2R5E NDUFA4 PLAUR NT5C3A RBL2 PHTF2 ZFAND5 CCR6 BCL3 GYPC SPTY2D1 MBP AKAP9 CD83 HNRNPF RAC1 SSR2 SMARCC2 MRPS21 SRSF3 DDX21 NDUFC1 BST2 FSD1 TRIB2 SF3B6 GDI2 HLTF APBB1IP LY6E NDUFB4 NAA38 FKBP1A PSMB10 FDPS MBNL1 UQCRB TMED2 H1-10 RO60 CEP57 CDKL3 CAPNS1 GIMAP1 EZR HOXB4 JAGN1 EIF2S2 DNAJB9 KCTD20 ARGLU1 EWSR1 SH3GL1 EIF3F RABGGTA MEA1 MLLT3 SARNP UBAC2 TRA2A ARL6IP1 GPI GUK1 DPH3 TGFBR2 ASB2 SRP9 GNAI2 RAB5IF PPIH EIF4A2 UBQLN1 GPR65 RASAL3 NDUFB6 RPN1 HNRNPA3 GBF1 OGG1 LCP2 DESI2 DUSP2 MGST3 CCDC18-AS1 CD53 POLM RTRAF ME2 PHAX ATP6AP1 RDH11 TCERG1 ACTR8 SLC1A5 DSE SLC25A39 FDFT1 CIAO2A LATS1 WDR33 CCDC107 MARCHF7 FAM174C LRPAP1 COG6 IL10RA MNAT1 CLIC3 TUSC2 NAT9 FAM13B EPG5 FKBP11 CREB3 ARHGAP45 MRPL9 MGA MIDEAS PHF20 CHD9 RCE1 VDAC2 EEF1E1 SSB GBP5 NMRK1 CCT8 SCAP ACER3 VAMP8 TALDO1 PHGR1 TC2N GRID2IP EIF2AK1 TOLLIP TBC1D10A C17ORF100 NFRKB CCDC137 E4F1 XRCC5 FCMR PGPEP1 DGCR6L CHURC1 OFD1 MAP4 PLEKHA2 AMD1 MORF4L1 IER5 DCTN2 CTDSP1 DDX6 AHCYL1 PNKD ELOVL1 FXR1 SAPCD1 SELENOT UBE2L6 PMM2 RNF113A PGRMC2 CPSF6 ZNF267 MRPL50 IL16 CREBL2 ASXL2 NUDCD3 ZNF445 C3AR1 CCR5 DGKE RUNX1 AKAP11 RP11-562I5 ARIH1 PPP1CB NDUFAF3 MRPS5 ACTR1B TPD52 SF3A2 CLTC TMEM33 NDUFA5 HMGXB4 PPP1R21 RC3H2 ZNF580 PRKCSH GZMM SEC13 CTNNA1 VPS18 SYNGAP1-AS1 MAPRE1 ZC3H8 DSTN MAN2B2 MGAT4A PAK2 CRY1 LSM7 ZC3H6 TMEM230 CPQ PSMD3 TIMM17A TYMP ST6GALNAC6 PHB2 KDELR2 LAMP2 NUMB RALY ADAM8 R3HDM4 DNAAF2 ADPRM CCSER2 THRAP3 KIDINS220 RBM15B DEF6 ADNP POLR1E IQGAP1 POFUT2 EIF3K NOL7 NMUR1 NOM1 GNPDA1 PKM CWC25 IGF2R ATP2B4 GPS1 LRP10 CA10 ERI3 SMAD3 ANP32B COX7A2L EHBP1 CHUK H2AZ1 LPIN2 ASAH1 MRPL33 BCL2 DNAJC1 USO1 SUMF2 CHCHD3 NAA20 TMEM60 IER3IP1 FLII YBX1 LAIR1 MESD TSC22D1 SF1 TMEM273 PLCG1 NMD3 TMEM18 ABHD14A SLFN5 ENTR1 CHMP6 IRF2BP2 CHCHD5 HTATSF1 RPS10 LSM14A SUPT7L PEX26 YJU2 HDDC3 CFDP1 SELPLG ENO1 KDM5A MTLN ZNF706 ERVK3-1 DCP1A DBNDD2 TAX1BP3 SPINK4 RNF31 NEPRO SHFL PSMD8 IL17RA CHST14 MPC1 GLOD4 BICRAL SMIM26 KIF5C PRKAR1A TENT5C NPRL2 TTC9C RPUSD3 SSH1 ARHGAP35 RSBN1 ECHDC2 G3BP2 PRR5 SMG9 SUPT3H ATP2A3 SIN3B LSM12 PTEN ZNF93 NONO SRRM2 ITFG1 C21ORF91 HINT1 GOT1 C1QTNF1 AZIN1 MZT1 LAP3 NDUFB8 NDUFA13 FRMD4B LYRM2 IL21R PARP12 SCMH1 PBRM1 HAUS4 RAB4B FLJ40194 HIKESHI NR1D1 MFSD8 SRSF1 SMIM30 XRN2 SAMD4B SEM1 CAPZB PIAS4 IKZF1 EIF3I AGTPBP1 MZT2B C1ORF122 MUTYH WDR61 DDX39B CCNDBP1 ZNF816 PFDN2 POLR2F ANO6 NDUFS6 CTNS ATP11B SIVA1 EIF4E3 CTSC SMC4 CDC42SE2 FAM131C PLPP6 TRPC4AP MTMR14 GABARAPL1 FAM133B RFNG EIF2B1 LSM1 ISY1 EIF3G ORAI2 FAM53B SEPTIN2 HTRA2 SEPTIN6 KAT7 MBD4 CAPZA2 WDR20 NDUFB7 TWF2 COMMD5 GPR68 SPPL3 TMEM59 NCL PJA2 SDHA SLMAP CBX3 TMEM161B-AS1 PPOX ACBD3 GOLGA7 GTF2H1 MAP3K2 COA6 PHF5A COPS7A NAA10 SNX14 CUL1 ALG13 TNIK STARD3 LINC00892 SNAI3 RABAC1 USP42 NDUFA3 DDX58 PSMC6 RBCK1 HNRNPH3 ZFAND6 SDHD CHCHD10 PUM1 LDLRAD4 TCHP CNOT11 KIF1C SEC61G TACC1 PSMD6 NIPBL DAPP1 RAB2A TRIM5 LMNB2 SUPT6H AP3S1 CLK1 NSUN2 NUP88 PPA1 UMAD1 SETD9 DCAF15 ANXA2R TMEM9B EGLN2 MRPS18B PSD4 GTDC1 SUCO BUD31 IER2 UPF3A PINX1_ENSG00000254093 NUDC TAF7 UTP15 PTPMT1 PSME2 MALT1 PSTPIP1 TSPYL1 CTSD HIGD2A SBF1 LINC00667 FEM1A RAB5C TAF2 SHOC2 RASL11A AMZ2 TRIM65 ATP6V0E1 HEBP1 PSMD1 NDUFAF8 CORO1B MRPL34 UBN1 RAE1 SPINT2 SOCS1 TRIM52-AS1 TFDP1 TES YIPF2 STX5 BZW1 PDCL3 NOLC1 TPM3 INTS11 RAB35 COQ10B ZNF480 C1D THAP4 EBAG9 FOXP1 SLC30A6 NUS1 MRPS24 PPP2CA USP4 C9ORF16 MSH2 UQCRC1 SLC38A1 ACO2 WASHC5 TXLNG PHF20L1 RAB6A SERINC1 YTHDC1 MIF4GD C1GALT1 KRR1 NDFIP1 GMFG DCAF8 SNX6 PAIP2 PLAA SRP19 RAD21 TECR OTUB1 CLK3 BNIP2 NEK9 COPE RAB9A RMC1 JOSD2 MAP3K1 RNF144B ADGRE5 UBA5 NLRC3 LYPLAL1 SURF6 PDE9A ZNF720 C17ORF107 SPPL2A OPTN LACTB DNAJA2 CAMK4 PLP2 ARMC6 IDH2 FUNDC2 SNHG3 CAPG TMEM106C DR1 PGGHG TUT7 CYB561D1 EIF3E FURIN VPS26B APBB1 NTAN1 ATP5MF MAX MRPL14 ATP5ME ORAI1 SEC62 TNFRSF1A HTT EIF2AK4 SCAPER POLB FAM204A RRP12 ABHD5 IMPDH1 TAF1D TDG IMP4 SMC6 TRIM28 FAS ILK PPP1R16B TSR3 NOP56 EIF5B CUTA GLO1 PPP3CB CENPK MTRR ITGAE GOLGA3 GFER EPSTI1 THEMIS TRIAP1 HSD17B7 TSTD3 ANKRD39 PIP4K2A MTFP1 FAM222B SRFBP1 HOXB2 TRAPPC5 SRSF8 HECTD1 CCAR1 XKR8 RBM26 ISCA1 TRBC2 UBE2B TIMM8B RPUSD1 SAT2 FAM102A ST6GAL1 METTL9 C2ORF68 PERP SNU13 ANKZF1 CBX5 GPN1 COMMD8 EGLN1 HERPUD1 SENP6 SLC38A2 AARS1 GNL2 G3BP1 GPR108 PSMA4 SP100 MFSD10 TMEM70 QRSL1 DAD1 RPS19BP1 SNHG1 CSDE1 TAF4B ASF1A N4BP2L2 RNF138 FBXL5 JAK1 NCLN RAB11B PPP1CA ITGB2 KDELR1 PSMB1 MTX2 SLC35E2A NDUFS8 HAGH PIGW IVNS1ABP SDHC POLDIP2 EIF4G1 ANKRD36 SEPTIN7 CSNK1D PSMF1 TM9SF4 GAB3 MYL6B HNRNPL WIPF1 GATAD1 RFFL PRKCQ-AS1 GTPBP10 INO80D IFITM3 GAS5 WASHC1 CTNNAL1 PTGER2 USP33 ARRB2 ARMCX6 GMIP UMPS TRIM22 DDX41 SFPQ RBM25 MRPL38 CNP FAM193B MAP3K11 WDR74 PIGC CDC37 PPIP5K2 TRMO AGTRAP JMJD8 ARID5A ALKBH7 CYTH1 SET PAF1 SNHG5 FAM174B ADI1 SUDS3 CDCA7 GTSF1 MTCH1 TGIF2 MLX PYGO2 EIF2S1 DNPEP IL23R RPA1 CDC37L1 GYG1 ANP32A SRSF2 RNASEH2B WDR44 ZPR1 GIPC1 RRM2B ILVBL ARL2BP BRK1 YIF1A CEP164 DPY30 NUDT7 PPP2R1A R3HDM1 MSN PSIP1 HMGN4 CCDC152 STOML2 EEF1AKMT2 H2AZ2 CHST12 CHFR FKBP15 SPNS1 APH1A NR4A3 CNOT9 GPSM3 TESK1 CDK5RAP1 TNFAIP1 FAM118B ZBTB7A PCED1B-AS1 MRPS18A BUB3 ICMT DDIT3 DNM1L DCTN4 PPIG CCDC91 LZIC CHD2 BIRC2 AHSA1 ARAP2 COMMD10 NUCKS1 ASAP1 DCTPP1 ETFDH NAA80 STAU2 CCNL1 CASP3 CCDC174 PRNP APOA1-AS PPP1R11 SRP68 C1ORF43 RASSF1 MYD88 AKAP13 SLC25A5 ERH UBE2I ATP5PD ORC3 NUP37 TTC31 WBP2 PDE7A KPNB1 ETFA STK25 PJVK COX19 ATP5MC3 PPP2R2A MDH1 PSMB6 GTF2E2 B3GNT2 TSC22D4 GPD2 MBIP ZFR JTB GPRIN3 LPIN1 SLC35A1 NXT1 KLHL22 DYNC1I2 AP1S1 PPP1R15B SMIM15 TMEM65 HAUS3 KLHL28 MRPL20 MVK ATR MIB2 YPEL3 SCOC WASHC2C SPRY1 LCLAT1 CCDC25 SLC43A3 CIAO3 WBP11 TRIM4 C17ORF49 DTX1 VAMP5 ZSCAN16-AS1 TNFSF13B ARHGAP15 CISH DCP2 GRPEL2 PHB AASDHPPT PUM2 ISG20 PFKM SLC2A1 NBL1 TMED4 COPS3 SRSF11 ZNF710 POMP TTC14 ZNF414 PISD TMEM219 MIR23AHG COQ4 OXA1L DNAJA3 KMT2A MAGOH LYPLA2 WSB1 POLR2B AKIRIN2 PSMB2 UIMC1 DDX18 TCOF1 TNFRSF14 STARD3NL CELF2 NABP1 PTP4A2 LRRC59 TMEM167B CAMK2G SLC4A1AP EIPR1 NUCB2 MED6 VEZT SP4 ARL3 USP12 ARHGAP9 VMAC FIP1L1 RP11-316M20 ADH5 TMC6 IRF1-AS1 LIG4 SNX1 MACF1 GNL1 TPM4 RMND5A LPAR6 DNAJB2 NOP53 CDK2AP1 FIS1 PITPNM1 SF3B4 SNAP47 S100PBP C4ORF33 CHST11 RGS18 C7ORF31 PSMD4 EFCAB7 TTC39C ANXA5 KRI1 FAM214B DUSP12 NR2C1 CBLB UNG LSM8 MYO9B GLMN RALA GTF3C6 ISCA2 DNTTIP2 SELENOK MRPS36 KPNA3 THRA PTS DUSP6 VTA1 PPARG SEPTIN9 UBE2N FRY BCAS2 CAMKMT SAMHD1 GBP2 TMEM50A STAT5A IL17A PCYT2 ISG15 MAPK3 PDCD4-AS1 TIAL1 RTF1 CSNK2B RBM17 LPCAT4 ST13 CNBD2 TMEM184B RNPC3 ARCN1 SMAP2 BLOC1S1 NAMPT ZNF597 EMG1 SLC66A3 HDLBP C11ORF58 SH3BP1 SMARCB1 RP11-726G1 ZNF564 IFI27L2 ATG3 C6ORF226 REL GIMAP6 OLFM2 OGFR BAZ1A RAB7A EIF3L TBC1D20 SDHAF2 PARP10 GAK FYN TBXAS1 PRDM1 FAM241A TMEM263 SRPK1 CD2BP2 SERGEF SURF4 MTERF4 PPP1R10 IRF9 USP14 ZFAND2B AK3 CYTH3 PEBP1 PLIN2 NARF ICAM2 ZRSR2 HELZ C14ORF119 DCAF5 RBM42 ITFG2-AS1_ENSG00000258325 ZNF524 CRNKL1 FASTK UBXN1 MYO1G QTRT1 FLOT1 DNASE2 STK40 USP47 RFLNB ZFYVE28 FHL1 TATDN2 SLC25A25 ATP6V1B2 GSAP MRPL10 PSMD11 ZNF821 RANBP9 TIPRL TMED10 EIF3M HMGN2 UHMK1 TRBV30 EMP3 HPS6 TTC17 OSBPL8 COMT ERP29 GRPEL1 ZNF595 ROGDI RAD23A LYSMD2 ANKRD44 FRMD8 SRSF10 BCL2A1 TBC1D31 ALDH9A1 CHMP1A GABARAPL2 RPS20 MRPS2 CAST TMEM200A H4C2 ATRAID CRYBG1 CRY2 OSBPL7 TRAPPC1 POLR2K TESMIN UBE2D1 GNAI3 SLC39A6 HMGCS1 TMEM138 TMEM223 CHP1 ATP5MJ MSRA SQLE ATRX ZNF207 PPP1R15A DNAJC3 SNX3 FBXO9 FAM162A BTF3L4 NKIRAS1 PRDX6 EHMT1 SYS1 USP36 STX4 TMED7 NR4A1 GNA13 LINC00299 ST3GAL5 IDH3A NAA16 SLC38A10 CRYBB2 B4GALT1 METTL26 MRPL54 TRAF3IP2 DTX3L NDUFA6 NDUFAF4 SUCLA2 CALHM2 GTF2IRD2B RP11-706O15 IK MAGI3 TCTA PBDC1 BRD7 IRF1 P2RY10 CTBS CUX1 PRKD2 PGK1 BCLAF1 IRAG2 CIAPIN1 TNFRSF25 SNRNP200 IL2RG PSMA6 UQCRC2 RTF2 SLC35C2 GNG2 ZBTB22 MEF2A NOP10 LY9 MAP2K2 COMMD1 TMEM203 VPS29 SOD2 CITED2 FLI1 FMNL1 MRTO4 RETREG2 FGFR1OP2 RNF41 UTP4 TKT DHX36 ID2 TMEM245 ERP44 ESF1 BAG3 NDEL1 DAP3 HM13 PTGR2 ZFAND3 UCP2 BIN2 MCF2L2 PHKB SUB1 PDE3B CEP250-AS1 RNH1 WDR1 FAM117A CAMTA2 UGCG LINC00294 PHF14 PTPN22 KATNBL1 NT5DC1 TRIM38 ZNF780A C19ORF53 SNRPG TDP2 EIF4H TMEM140 ZCCHC17 ANKLE2 CDS2 EPHB6 CCDC90B ATAD3C COL18A1 FBP1 AHCY AFTPH ZNF432 HSPA9 CMSS1 FGD3 ANXA7 SLC25A13 RNF5 KRT10 KDM2B-DT ATP5F1D SCAMP3 KIF2A SHMT1 MRPL57 CD5 MLH1 ACOX3 ZNF593 TTI2 TRAPPC10 SRP72 CXCR6 COPS5 TET2 HEBP2 OGT ANKRD13D CEBPZ MRPL23 UBE2V1 AP2M1 SLC39A4 LLGL1 DLD ACAA2 SCAMP4 PARP8 CXXC1 MAD2L2 GIT2 OSGIN2 FAM219A ERLEC1 PTPN7 TMEM41B SERTAD3 TCF7 ARPC5 NCBP2AS2 ZNF211 SNRPB2 B9D2 ANKRD27 TIPARP RAB1B SRM PARP16 EIF4EBP2 GDI1 ARPC5L CEACAM21 IVD EXOSC4 KDM6A STARD7 JAK3 CBR1 ARL6IP5 STK26 UFM1 ARHGAP18 ESD ISOC2 HCFC1 RING1 PRELID1 MGME1 NDUFC2 CELF1 PNP IFT27 KBTBD2 TUFM TSTD1 CISD3 C17ORF67 BUD23 SCRN2 CHCHD7 FARSA RP11-29H23-1 LINC02481 TPP1 THYN1 ARHGEF1 GCLM PRPF39 TRAPPC6B SLF1 SMIM3 EVI2A AZIN2 ASH1L API5 COG2 SLC25A22 CHD4 WRAP73 CDC26 UBE2L3 EPC1 ATP2B1 CCAR2 RRN3 UBA1 CSNK1A1 CASP8 PELI1 CNOT2 SRP54 VAPA MRPS33 NDUFB11 NAF1 UBE2G2 DVL3 TLE5 CALCOCO2 UQCR10 SIRPG COX7B NDRG1 EIF2S3 H3C4 SMCHD1 RAN MAPKAPK5-AS1 TSG101 HAX1 ANKDD1A DCTN3 CD55 PSKH1 PET100 IMMT SRGAP2C H3-3A PIGH RBM48 FAM177A1 TEFM FBXL4 MID1IP1 PRR5L SLC25A11 NGDN NCAPD3 SLC9A3R1 CDIP1 GRK6 SDF4 DENND2D CASP7 EFR3A NFAT5 PRELID3B TEX30 CIRBP CSNK1G1 MCRS1 CHMP2B TMOD3 PRKX POLR2C NSFL1C POLR1H TAP2 PIANP ARL6IP6 RIPK1 TDP1 SCAF11 SNRPN ZNF335 USP3 UBQLN2 ARMCX3 HSDL1 RSU1 TXNL1 TMEM101 BAD KIF5B AP1AR FAM104A GBP4 AC058791 NDUFB1 DPP7 NDUFB2 RANGAP1 SETX MAF1 NME4 TRGV10 TACC3 AP5Z1 PITPNB EDEM1 CTD-2095E4 ANKRD28 MINDY2 YPEL4 IGHV4-34 TRIM44 ARL6IP4 GART RAD18 SF3B5 BZW2 ECD ZYX HECW2-AS1 CNTRL RECQL PARP9 UPP1 PYCR2 ARHGAP30 TBC1D22B RASGRP1 STIM1 ZNF263 ZNF230 LDLRAP1 PDCD2 FOXO1 UNC119B RNF114 YPEL5 IFNGR2 PRRC1 EIF4B PDXK RAB8B RICTOR ARL8A ESCO1 RBM38 PPIF GMPS HDDC2 FBXO42 PTP4A3 SERPINH1 TRIM69 WTAP PEMT VPS72 WAS PPP4R3A CDV3 CETN2 VAC14 TRIM21 DAZAP2 ARF5',
 'cell_type': 'CD4-positive helper T cell',
 'tissue': 'ileum',
 'batch_condition': 'A29',
 'organism': 'Homo sapiens',
 'sex': 'female'}

When we print out an entire sample, we can see that it is a Python dictionary. The cell sentence contains a sentence of gene names ordered by descending expression level, giving a rank-based gene name representation of the cell. The rest of the columns of adata.obs which were specified also show up in the dataset sample.

This dataset format will allow us to work with cell sentence datasets in an efficient manner. For more details on the cell sentence transformation, please review the Cell2Sentence paper: https://openreview.net/pdf?id=EWt5wsEdvc

In [27]:
len(arrow_ds[sample_idx]["cell_sentence"].split(" "))  # Cell 0 has 2191 nonzero expressed genes, yielding a sentence of 2191 gene names separated by spaces.
Out[27]:
2191

Next, we will examine the vocabulary which was generated:

In [28]:
print(type(vocabulary))
print(len(vocabulary))
<class 'collections.OrderedDict'>
23944

We can see that vocabulary is an OrderedDict of gene features, corresponding to the original 23944 genes in our adata object. The OrderedDict denotes the gene features present in our single-cell dataset, and also stores the number of cells that gene was expressed in.

In [15]:
list(vocabulary.items())[:10]
Out[15]:
[('RP11-34P13', 38),
 ('RP11-34P13-3', 106),
 ('AP006222', 7),
 ('LINC01409', 1292),
 ('FAM87B', 3),
 ('LINC01128', 1849),
 ('LINC00115', 562),
 ('FAM41C', 225),
 ('LINC02593', 4),
 ('SAMD11', 11)]

CSData creation¶

Now that our AnnData object is converted into an arrow dataset, we can create a CSData object to wrap around our arrow dataset. This will help us manage the arrow dataset, keeping it saved on disk and out of memory until we need the data for inference or finetuning.

In [29]:
c2s_save_dir = "/ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials"  # C2S dataset will be saved into this directory
c2s_save_name = "dominguez_immune_tissue_tutorial1"  # This will be the name of our C2S dataset on disk
In [30]:
csdata = cs.CSData.csdata_from_arrow(
    arrow_dataset=arrow_ds, 
    vocabulary=vocabulary,
    save_dir=c2s_save_dir,
    save_name=c2s_save_name,
    dataset_backend="arrow"
)
Saving the dataset (1/1 shards): 100%|██████████| 29773/29773 [00:00<00:00, 48815.89 examples/s] 
In [31]:
print(csdata)
CSData Object; Path=/ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_immune_tissue_tutorial1, Format=arrow

The csdata object simply saves our arrow dataset onto disk and keeps a reference to the path. This wrapper class will work in concert with other classes such as CSModel and task functions to load the dataset whenever necessary, so that we avoid holding the C2S dataset in memory when it is not necessary.

We can retrieve and view cell sentences by calling the get_sentence_strings() function:

In [32]:
cell_sentences_list = csdata.get_sentence_strings()
In [37]:
print(len(cell_sentences_list))
# print(cell_sentences_list[1])
29773
In [38]:
def print_first_N_genes(cell_sentence_str: str, top_k_genes: int, delimiter: str = " "):
    """Helper function to print K genes of a cell sentence."""
    print(delimiter.join(cell_sentence_str.split(delimiter)[:top_k_genes]))
In [39]:
print_first_N_genes(cell_sentences_list[0], top_k_genes=100)
RPLP1 ACTB EEF1A1 HSP90AA1 TMSB4X B2M FTH1 KLF6 HSPA1B MALAT1 RPS12 HSPA8 RPL13 MT-CO1 ATF3 MT-CO2 RPL41 TPT1 MT-CO3 RPS19 HLA-B RPL10 RPS4X RPL28 MT-CYB DUSP1 RPL30 MT-ND4L RPS15 FOS RPL34 RPS2 RPLP2 MT-ND3 RPS18 RPS8 TRBV7-2 RPL32 RPS3 ANXA1 RPL11 HLA-C RPS27 ACTG1 UBC RPL3 RPL37 RPLP0 MT-ATP6 JUNB RPS28 RPL18 UBB MT-ATP8 RPS14 RPL39 PFN1 GAPDH HSPA1A RPL18A SRGN RPS27A RPL26 RPL19 RPS15A HLA-A DNAJB1 RPS3A CREM RPS13 MT-ND1 RPL21 RPS25 BTG2 RPL35A FAU RPL8 RPL7A RPS24 RPS6 RPS16 RACK1 NFKBIA RGS1 RPL29 CALM1 RPL9 RPL37A MT-ND5 TNFAIP3 RPS23 IL7R RPL36A PTMA NFKBIZ UBA52 EIF1 CRIP1 CORO1A RPL14
In [40]:
print_first_N_genes(cell_sentences_list[1], top_k_genes=100)
TMSB4X MT-CO1 B2M MT-CO2 MALAT1 HSPA1A ACTB HSP90AA1 RPLP1 MT-CO3 MT-CYB RPS27 JUNB MT-ND3 EEF1A1 HSPA1B HLA-B RPL41 RPS18 CD69 DNAJB1 JUN CCL5 RPL10 RPL28 FOS MT-ND4L MT-ATP8 RPS19 IL32 TPT1 RPL39 IFNG RPL13 FTH1 PPP1R15A RPS15A RPS14 PFN1 ZFP36 HLA-C CD52 EIF1 SH3BGRL3 PTMA MT-ATP6 RPS12 FTL ACTG1 NR4A2 RPS3 RPS15 GAPDH RPLP2 RPL11 HLA-A RPS4X TNFAIP3 RPL34 RPL18 RPL32 RPL37 RPL30 MT-ND1 RPL8 RPL19 RPS23 RPL18A MT-ND2 GZMA UBC NKG7 RPL21 RPS21 RPS27A RPS28 RPL35A GADD45B H3-3B RPS24 RPL3 RPL29 RPS6 VIM SELENOK RPLP0 S100A4 RPL26 RPS13 CD8A MTRNR2L12 RPS25 CD3E CD3D RPS7 FAU CFL1 RPL12 DUSP1 RPS8
In [41]:
print_first_N_genes(cell_sentences_list[2], top_k_genes=100)
TMSB4X EEF1A1 RPLP1 B2M RPL41 RPL10 HSPA1A ACTB MALAT1 HSP90AA1 RPL32 FOS DNAJB1 RPS27 RPS12 RPS19 MT-CO1 TPT1 MT-CO2 RPS18 RPL28 RPL34 CCL5 RPL13 RPL37 RPS6 RPS15A HLA-B RPS3 RPS4X RPS2 RPL11 MT-ND3 RPS14 MT-CO3 RPL39 RPS3A RPL30 MT-CYB RPL21 RPLP0 HSPA1B RPL18A DUSP1 RPL26 RPS23 RPL19 RPS13 HSPA8 VIM RPL3 RPS15 RPS21 FTH1 RPL29 RPL12 RPS27A RPL8 BTG1 RPS28 HLA-C RPLP2 MT-ND4L RPS7 RPL18 RPL7A EIF1 RPS8 RPL15 XCL1 SRGN RPS25 FAU SH3BGRL3 RPL17 MT-ATP8 UBC RPS29 RPL36 RACK1 RPL23A RPL6 ACTG1 IFITM1 JUN RPL13A PFN1 RPL14 RPL36A PTMA RPL37A RPL9 MT-ND2 RPS24 FOSB KLF6 IFITM2 MT-ND5 HLA-A HLA-E

Cell Sentence Transformation Benchmarking¶

We have successfully converted our single-cell dataset into cell sentences using the conversion functions, however it would be useful to know how well the conversion did, and how much expression information was lost when we switched to a rank ordering of genes rather than exact expression values.

In the C2S paper, a strong linear relationship was found between the log of the rank of a gene and its normalized expression value. We can similarly examine our rank transformation and reconstruction ability of the original expression by calling a rank transformation benchmarking utility function. This function will:

  1. Fit a linear model on the ranks and expression of the original data, which can be used to reconstruct expression from rank
  2. Save plots of log rank vs log expression and log expression vs reconstructed expression from rank

First, we define a path where the plots for the benchmarking and reconstruction will be saved:

In [42]:
output_path = os.path.join(c2s_save_dir, c2s_save_name)
output_path
Out[42]:
'/ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_immune_tissue_tutorial1'
In [44]:
transformation_benchmarking_save_name = "inverse_transformation_testing_tutorial_1"

We can call the benchmarking function with our output directory, as well as the normalized expression of our AnnData object. To avoid benchmarking on too many data points, we set a sample_size of cells to benchmark the rank transformation on.

In [45]:
benchmark_expression_conversion(
    benchmark_output_dir=output_path,
    save_name=transformation_benchmarking_save_name,
    normalized_expression_matrix=adata.X,
    sample_size=1024,
)
Benchmarking with a sample dataset of size 1024
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/plotnine/ggplot.py:587: PlotnineWarning: Saving 6.4 x 4.8 in image.
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/plotnine/ggplot.py:588: PlotnineWarning: Filename: /ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_immune_tissue_tutorial1/inverse_transformation_testing_tutorial_1_benchmark/normalized_expression_vs_log_rank.png
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/plotnine/ggplot.py:587: PlotnineWarning: Saving 6.4 x 4.8 in image.
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/plotnine/ggplot.py:588: PlotnineWarning: Filename: /ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_immune_tissue_tutorial1/inverse_transformation_testing_tutorial_1_benchmark/reconstructed_expression.png

Now, we can retrieve the slope and intercept of the linear model which was fit to predict expression from rank

In [46]:
metrics_df = pd.read_csv(os.path.join(output_path, transformation_benchmarking_save_name + "_benchmark", "c2s_transformation_metrics.csv"))
metrics_df.shape
Out[46]:
(1, 11)
In [29]:
metrics_df
Out[29]:
experiment x_axis y_axis threshold slope intercept r_squared pearson_r_statistic pearson_r_pvalue spearman_r_statistic spearman_r_pvalue
0 Normalized Expression vs Log Rank log_rank_normalized_expression normalized_expression 3 -0.675689 2.237853 0.794836 0.929959 0.0 0.866874 0.0

We can see here the slope and intercept of the linear model which was fit on the log rank versus normalized expression on our sample of cells. Furthermore, we can see correlation statistics of the inverse reconstruction, where the linear model predicts the original expression based on the rank of the gene.

We can see that the linear model achieves 0.867 R^2, matching what was seen in the C2S paper. This indicates that most of the variance in the data is preserved when converting to rank-ordered cell sentences and then recovering the expression from rank. This allows us to utilize cell sentences and LLMs without worry about losing too much information when converting back to expression.

In [48]:
slope = metrics_df.iloc[0]["slope"]
intercept = metrics_df.iloc[0]["intercept"]
print("slope:", slope)
print("intercept:", intercept)
slope: -0.6756886579917628
intercept: 2.237853249084892

Reconstruct Cell Expression From Cell Sentences¶

To further see the ability of the linear model to reconstruct original gene expression from rank in the cell sentences, in this section we will reconstruct expression vectors from cell sentences and visualize them against the original data.

First, we need to create a list of the gene names in our vocabulary. This will determine the ordering of genes in the expression vector we reconstruct:

In [49]:
vocab_list = list(vocabulary.keys())
print(len(vocab_list))
vocab_list[:4]
23944
Out[49]:
['RP11-34P13', 'RP11-34P13-3', 'AP006222', 'LINC01409']

Now, we will first reconstruct a single expression vector:

In [50]:
print(len(cell_sentences_list))
print_first_N_genes(cell_sentences_list[0], top_k_genes=100)
29773
RPLP1 ACTB EEF1A1 HSP90AA1 TMSB4X B2M FTH1 KLF6 HSPA1B MALAT1 RPS12 HSPA8 RPL13 MT-CO1 ATF3 MT-CO2 RPL41 TPT1 MT-CO3 RPS19 HLA-B RPL10 RPS4X RPL28 MT-CYB DUSP1 RPL30 MT-ND4L RPS15 FOS RPL34 RPS2 RPLP2 MT-ND3 RPS18 RPS8 TRBV7-2 RPL32 RPS3 ANXA1 RPL11 HLA-C RPS27 ACTG1 UBC RPL3 RPL37 RPLP0 MT-ATP6 JUNB RPS28 RPL18 UBB MT-ATP8 RPS14 RPL39 PFN1 GAPDH HSPA1A RPL18A SRGN RPS27A RPL26 RPL19 RPS15A HLA-A DNAJB1 RPS3A CREM RPS13 MT-ND1 RPL21 RPS25 BTG2 RPL35A FAU RPL8 RPL7A RPS24 RPS6 RPS16 RACK1 NFKBIA RGS1 RPL29 CALM1 RPL9 RPL37A MT-ND5 TNFAIP3 RPS23 IL7R RPL36A PTMA NFKBIZ UBA52 EIF1 CRIP1 CORO1A RPL14
In [51]:
expression_vector = reconstruct_expression_from_cell_sentence(
    cell_sentence_str=cell_sentences_list[0],
    delimiter=" ",
    vocab_list=vocab_list,
    slope=slope,
    intercept=intercept,
)
In [52]:
print(type(expression_vector))
print(expression_vector.shape)
print(expression_vector.dtype)
<class 'numpy.ndarray'>
(23944,)
float32
In [53]:
expression_vector
Out[53]:
array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)
In [54]:
expression_vector.sum()
Out[54]:
599.0824
In [55]:
print(len(cell_sentences_list[0].split(" ")))
print(np.nonzero(expression_vector)[0].shape)
2191
(2191,)

We can see that the function reconstruct_expression_from_cell_sentence() has performed the inverse reconstruction on the cell sentence, using the rank of each gene in the cell sentence to predict its original expression using the linear model we fitted earlier:

  • predicted_expression = intercept + (slope * log(rank_of_gene))

We can now repeat this and reconstruct the entire original dataset:

In [56]:
all_reconstructed_expression_vectors = []
for idx in tqdm(range(len(cell_sentences_list))):
    expression_vector = reconstruct_expression_from_cell_sentence(
        cell_sentence_str=cell_sentences_list[idx],
        delimiter=" ",
        vocab_list=vocab_list,
        slope=slope,
        intercept=intercept,
    )
    all_reconstructed_expression_vectors.append(expression_vector)

all_reconstructed_expression_vectors = np.stack(all_reconstructed_expression_vectors)
100%|██████████| 29773/29773 [01:45<00:00, 281.77it/s]
In [57]:
all_reconstructed_expression_vectors.shape
Out[57]:
(29773, 23944)

Let's now make a new AnnData object, copying the .obs and .var from our original adata, but putting in our reconstructed expression vectors

In [58]:
all_reconstructed_expression_vectors = scipy.sparse.csr_array(all_reconstructed_expression_vectors)
all_reconstructed_expression_vectors
Out[58]:
<29773x23944 sparse array of type '<class 'numpy.float32'>'
	with 48389984 stored elements in Compressed Sparse Row format>
In [59]:
reconstructed_adata = ad.AnnData(
    X=all_reconstructed_expression_vectors,
    obs=adata.obs.copy(),
    var=adata.var.copy()
)
reconstructed_adata
Out[59]:
AnnData object with n_obs × n_vars = 29773 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_name', 'ensembl_id', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

Quickly verify that the original adata.var gene list ordering matches the vocab_list which we reconstructed vectors with:

In [60]:
adata.var.head()
Out[60]:
gene_name ensembl_id n_cells mt n_cells_by_counts mean_counts pct_dropout_by_counts total_counts
RP11-34P13 RP11-34P13 ENSG00000238009 38 False 38 0.001310 99.872368 39.0
RP11-34P13-3 RP11-34P13 ENSG00000241860 106 False 106 0.003627 99.643973 108.0
AP006222 AP006222 ENSG00000286448 7 False 7 0.000235 99.976489 7.0
LINC01409 LINC01409 ENSG00000237491 1292 False 1292 0.045981 95.660498 1369.0
FAM87B FAM87B ENSG00000177757 3 False 3 0.000101 99.989924 3.0
In [61]:
vocab_list[:5]
Out[61]:
['RP11-34P13', 'RP11-34P13-3', 'AP006222', 'LINC01409', 'FAM87B']

Plotting Reconstructed Expression Vectors¶

Now we will plot original data and reconstructed expression vectors side by side, to verify that the cell sentence transformation has preserved most of the original variance of the data.

First, we will remove the extra attributes of our original adata object, since we will need to create a new joint UMAP.

In [62]:
del adata.uns
del adata.obsm
del adata.varm
del adata.obsp
In [63]:
adata
Out[63]:
AnnData object with n_obs × n_vars = 29773 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_name', 'ensembl_id', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
In [66]:
# Assign column specifying data origin before rowbind
adata.obs["c2s_data_label"] = ["Original Data"] * adata.obs.shape[0]
reconstructed_adata.obs["c2s_data_label"] = ["Reconstructed From Cell Sentences"] * reconstructed_adata.obs.shape[0]
In [67]:
combined_adata = ad.concat([adata, reconstructed_adata], axis=0)
combined_adata
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/anndata/_core/anndata.py:1838: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Out[67]:
AnnData object with n_obs × n_vars = 59546 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'c2s_data_label'
In [68]:
combined_adata.obs_names_make_unique()
In [69]:
combined_adata.var = adata.var.copy()
In [70]:
combined_adata.obs = combined_adata.obs[["cell_type", "tissue", "batch_condition", "organism", "sex", "c2s_data_label"]]
In [71]:
combined_adata
Out[71]:
AnnData object with n_obs × n_vars = 59546 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'sex', 'c2s_data_label'
    var: 'gene_name', 'ensembl_id', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
In [72]:
combined_adata.obs.head()
Out[72]:
cell_type tissue batch_condition organism sex c2s_data_label
Pan_T7935490_AAACCTGCAAATTGCC CD4-positive helper T cell ileum A29 Homo sapiens female Original Data
Pan_T7935490_AAACGGGCATCTGGTA CD8-positive, alpha-beta memory T cell ileum A29 Homo sapiens female Original Data
Pan_T7935490_AAACGGGTCTTGCATT CD8-positive, alpha-beta memory T cell ileum A29 Homo sapiens female Original Data
Pan_T7935490_AAAGCAATCATCGCTC CD8-positive, alpha-beta memory T cell ileum A29 Homo sapiens female Original Data
Pan_T7935490_AAAGTAGCAGTCACTA gamma-delta T cell ileum A29 Homo sapiens female Original Data
In [73]:
combined_adata.obs.tail()
Out[73]:
cell_type tissue batch_condition organism sex c2s_data_label
Pan_T7935489_TTCTCCTCATCCAACA-1 CD8-positive, alpha-beta memory T cell sigmoid colon A29 Homo sapiens female Reconstructed From Cell Sentences
Pan_T7935489_TTGAACGAGTTGTCGT-1 T follicular helper cell sigmoid colon A29 Homo sapiens female Reconstructed From Cell Sentences
Pan_T7935489_TTGAACGCACGCCAGT-1 group 3 innate lymphoid cell sigmoid colon A29 Homo sapiens female Reconstructed From Cell Sentences
Pan_T7935489_TTGTAGGAGGGCATGT-1 germinal center B cell sigmoid colon A29 Homo sapiens female Reconstructed From Cell Sentences
Pan_T7935489_TTTACTGGTTGTGGCC-1 effector memory CD8-positive, alpha-beta T cel... sigmoid colon A29 Homo sapiens female Reconstructed From Cell Sentences
In [74]:
combined_adata.X.max()
Out[74]:
3.408124

We can now run PCA, Scanpy's neighbors algorithm, and then the UMAP algorithm:

In [75]:
sc.tl.pca(combined_adata)
computing PCA
    with n_comps=50
    finished (0:00:57)
In [76]:
sc.pp.neighbors(combined_adata)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:27)
In [77]:
sc.tl.umap(combined_adata)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:44)
In [78]:
combined_adata
Out[78]:
AnnData object with n_obs × n_vars = 59546 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'sex', 'c2s_data_label'
    var: 'gene_name', 'ensembl_id', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: 'pca', 'neighbors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
In [79]:
combined_adata[combined_adata.obs["c2s_data_label"] == "Reconstructed From Cell Sentences", :]
Out[79]:
View of AnnData object with n_obs × n_vars = 29773 × 23944
    obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'sex', 'c2s_data_label'
    var: 'gene_name', 'ensembl_id', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: 'pca', 'neighbors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
In [86]:
os.path.join(output_path, transformation_benchmarking_save_name+"_benchmark", "umap_comparison.png")
Out[86]:
'/ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_immune_tissue_tutorial1/inverse_transformation_testing_tutorial_1_benchmark/umap_comparison.png'
In [87]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 4.5))
sc.pl.umap(
    combined_adata[combined_adata.obs["c2s_data_label"] == "Original Data", :],
    color="tissue",
    size=8,
    title="Original Human Immune Tissue Data",
    show=False,
    ax=ax1
)
sc.pl.umap(
    combined_adata[combined_adata.obs["c2s_data_label"] == "Reconstructed From Cell Sentences", :],
    color="tissue",
    size=8,
    title="Reconstructed From Cell Sentences",
    show=False,
    ax=ax2
)
plt.tight_layout()
# plt.savefig(os.path.join(output_path, transformation_benchmarking_save_name+"_benchmark", "umap_comparison.png"), 
#             dpi=300, bbox_inches="tight")
plt.show()
plt.close()
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/anndata/_core/anndata.py:1230: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/anndata/_core/anndata.py:1230: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
/ix/ccdg/storage3/til177/custom_miniconda/envs/cell2sentence/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
No description has been provided for this image

We can see that our conversion to cell sentences and back to expression vectors has preserved the structure of the data, due to the accuracy of the linear model in predicting expression from the rank information. This means we can train LLMs and perform single-cell tasks with them, and once we have predicted cell sentences, we can always convert them back to expression vectors whenever necessary for further analysis!